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Abstract —The great prosperity of big data systems such as Hadoop in recent years makes the benchmarking of these systems 
become cruciai for both research and industry communities. The compiexity, diversity, and rapid evoiution of big data systems gives 
rise to various new chaiienges about how we design generators to produce data with the 4V properties (i.e. voiume, veiocity, variety 
and veracity), as weii as impiement appiication-specific but stiii comprehensive workioads. However, most of the existing big data 
benchmarks can be described as attempts to soive specific probiems in benchmarking systems. This articie investigates the 
state-of-the-art in benchmarking big data systems aiong with the future chaiienges to be addressed to reaiize a successfui and efficient 
benchmark. 
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1 Introduction 

B ig data systems have gained unquestionable success in 
recent years and will continue its rapid development 
over the next decade. These systems cover many industrial 
and public service areas such as search engines, social 
networks. E-commerce sites and multimedia, as well as 
a variety of scientific research areas such as bioinformat¬ 
ics, environment, meteorology, and complex simulations of 
physics. Conceptually, big data are characterized by very 
large data volume and velocity, highly variety (diversity) in 
data types and sources, and stringent requirements of data 
veracity (fidelity). In the era of big data, the complexity, 
diversity of big data systems and the emergence of new 
systems driven by the exploration of big data values give 
rise to new challenges in how to benchmark and under¬ 
stand these systems successfully and efficiently. Enabling 
such benchmarking is essential so that system designers, 
programmers and researchers can optimize the performance 
and energy efficiency of big data systems and promote the 
development of big data technology. 

Conceptually, a big data benchmark aims to generate 
application-specific workloads and tests capable of processing 
big data with the 4V properties (volume, velocity, variety 
and veracity) (T| in order to produce meaningful evaluation 
results [3|. In this article, we survey the state-of-the-art 
of big data benchmarks in both academic and industry 
communities and provide a foundation towards building 
a successful big data benchmarks. The remainder of this 
paper is organized as follows. Eirst of all, we present the 
methodology and requirements of benchmarking big data 
systems, which represents our insights into how to con¬ 
duct successful benchmarking (Section [^. Next, the data 
generation techniques and the workload implementation 
techniques are presented to identifying the characteristics of 
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big data benchmarks. Eollowing this, a number of current 
benchmarks on evaluating big data systems are summa¬ 
rized (Section [^. Einally an open discussion of research 
challenges are presented to stimulate productive thinking, 
investigation, and development in this research area (Sec¬ 
tion]^ and the paper is summarized (Section]^. 

2 Methodology and Requirements of 
Benchmarking Big Data Systems 

In this section, we first present some basic concepts of big 
data systems (Section |2.1| , after which the methodology 
(Section [Z^ and requirements (Section [0) of benchmarking 
such systems are introduced. 

2.1 Big Data Systems 

Technically, a big data system can be characterized by the 
data with the 4V properties and the workloads taking the 
data as inputs. We first explain the 4V properties as follows. 

Volume. Volume represents the amount/size of data 
such as TB, PB or ZB. Today, data are generated faster than 
ever. Eor example, about 2.5 quintillion bytes of data are 
created every day IT] and this speed is expected to increase 
exponentially over the next decade according to Interna¬ 
tional Data Corporation (IDC). In Eacebook, there are 350 
million photos updated and more than 500 TB data gener¬ 
ated per day. Moreover, data volume has different meanings 
in workloads used to process different data sources. Eor 
example, in workloads for processing text data (e.g. Sort |3l 
or WordCount), the volume is represented by the amount 
of data (e.g. 1 TB or 1 PB text data). In social network 
workloads such as connected component, the volume is 
represented by the number of vertices (e.g. vertices) in 
social graphs. 

Velocity. Velocity reflects the speed of generating, updat¬ 
ing, or processing data. Eirst of all, data velocity represents 
the data generation rate, such as generating 10 TB per hour. 
Secondly, many big data applications such as e-commerce 
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sites and social network have continuously updating data. 
In this case, data velocity represents the data updating 
frequencies. Finally, in streaming processing systems, data 
streams must be processed in real-time to keep up with 
their arriving speed. Hence data velocity represents the data 
processing speed. 

Variety. Variety denotes the range of data types and 
sources. The fast development of big data systems gives 
birth to a diversity of data types, which cover structured 
data (e.g. tables), unstructured data (e.g. text, images, au¬ 
dios, and videos), and semi-structured data (e.g. graph and 
web logs). 

Veracity. Veracity reflects whether the data used in big 
data systems conform to the inherent and important charac¬ 
teristics of raw data. This property is important to guarantee 
the reality and credibility of benchmarking results. 

Moreover, today's big data systems have a diversity of 
workloads whose behaviors (e.g. resource demands) are 
determined by two factors: (1) Computation semantic. This 
factor decides the implementation logic of workloads. For 
example, three Hadoop analytics workloads. Sort, Word- 
Count, Grep, have different computation semantics (source 
codes), thus having different resource demands: Sort is 
an I/O-intensive workload, WordCount is a CPU-intensive 
workload with dominated integer calculations, and Grep 
has similar demands for CPU and I/O resources. (2) Soft¬ 
ware stacks. A software stack consists of a set of programs 
working together to provide a fully functional solution. 
Model big data software stacks such as Hadoop |4j and 
Spark (5j usually provide rich libraries to facilitate the 
development of new applications. Hence a programmer 
can focus on writing a few lines of codes to implement 
an application (e.g. the Map and Reduce functions of a 
Hadoop MapReduce application) and leave parallelization, 
job scheduling, fault tolerance, and other tasks to the soft¬ 
ware stack. Software stacks, therefore, cause considerable 
impact on a workload's behavior. For example, the Hadoop 
WordGount workload has similar behaviors to the Hadoop 
Bayes classification workload, but has different behaviors 
from the Spark WordCount workload |i6|. 

2.2 Benchmarking Methodology 

Big data benchmarks are developed to evaluate and com¬ 
pare the performance of big data systems and architectures. 
Successful and efficient benchmarking can provide realistic 
and accurate measuring of big data systems and thereby 
addressing two objectives. (1) Promoting the development 
of big data technology, i.e. developing new architectures 
(processors, memory systems, and network systems), inno¬ 
vative theories, algorithms and techniques to manage big 
data and extract their value and hidden knowledge. (2) 
Assisting system owners to make decisions for planning 
system features, tuning system configurations, validating 
deployment strategies, and conducting other efforts to im¬ 
prove systems performance. For example, benchmarking 
results can identify the performance bottlenecks in big data 
systems, thus guiding the optimization of system configura¬ 
tion and resource allocation. 

Figure 1 shows a typical benchmarking methodology 
for big data systems and it consists of five stages. After 


selecting the application domain at stage 1, stage 2 sur¬ 
veys the representative applications in this domain. From 
these applications, this stage identifies data models from 
real data, data operations and workload patterns from real 
workloads, and evaluation metrics from performance and 
cost indicators. Based on the identification results, stage 3 
implements data generation tools to produce data sets with 
the 4V properties and implements workloads to support 
application-specific benchmarking tests. Subsequently, stage 
4 determines the target system, and prepares the input 
data and the benchmarking prescription used to test this 
system. A prescription includes all the information needed to 
produce a benchmarking test, including input data, work¬ 
loads, a method to generate test, and the evaluation metrics. 
Finally, the benchmark test is conducted at stage 5 and the 
evaluation result is analyzed and evaluated. 


Stage 1 
Stage 2 

Stage 3 

Stage 4 

Stage 5 


Select application domain 


Identify 


Identify data operations 


Idcntrb.- 

data models 


and workload patterns 


metrics 


O 


O 


<y 


Prepare input data and 
benclunarking prescriptions 


Implement rvorkloads on 
differenl softrvare stacks 


Implement data 
generation tools 


Determine which 
system to benclunark 


Execute tests Analyze and er alnate results 


Fig. 1. The five-stage benchmarking methodoiogy for big data systems 


2.3 Benchmarking Requirements 

In this section, we discuss the requirements of performing 
a fair, efficient and successful benchmarking test of big data 
systems. 

2.3.1 Generating Data with the 4Vproperties 
Some traditional benchmarks use real data as inputs of their 
workloads and thereby guarantee data veracity. However, 
the volume and velocity of real data cannot be flexibly 
adapted to different benchmarking requirements. Based on 
our experience, we also noticed that in many practical 
scenarios, obtaining a variety of real data is difficult because 
many data owners are not willing to share their data due 
to confidential issues. In big data benchmarks, therefore, 
the consensus is to generate synthetic data as inputs of 
workloads on the basis of real data sets. Hence in synthetic 
data generation, preserving the 4V properties of big data is 
the foundation of producing meaningful and credible eval¬ 
uation results. We discuss the data generation requirements 
based on the typical data generation processing shown in 
Figure 2. 

At the first step, the data generation tools support the 
variety of big data by collecting real data to cover different 
data sources and types. The tools can also generate synthetic 
data sets directly. This is because in practice such purely 
synthetic data can be used as inputs of some workloads such 
as the Sort and WordCount in Micro benchmarks, and the 
Read, Write, and Scan belonging to basic database operations. 

























TECHNICAL REPORT. ICT, ACS 


3 


At the second step, each tool employs a data model to 
capture and preserve the important characteristics in one or 
multiple real data sets of a specific date type. For example, 
a text generator can apply Latent dirichlet allocation (LDA) 
(Zl to describe the topic and word distributions in text data. 
This generator first learns from a real text data set to obtain 
a word dictionary. It then trains the parameters and a LDA 
model using this data set. Finally, it generates synthetic text 
data using the LDA model. To preserve data veracity, it 
is required that different models should be developed to 
capture the characteristic of real data of different types such 
as table, text, stream, and graph data. 

At the third step, the volume and velocity can be con¬ 
trolled according to user requirements. For example, the 
data generation can be paralleled and distributed to mul¬ 
tiple machines, thus supporting different data generation 
rates. 

At the final step, after a data set is generated, the format 
conversion tools transform this data set into an appropriate 
format capable of being used as the input of a workload 
running on a specific system. 



I I , 

Vanet\’ Veracit\ Volume and Velocity 


Fig. 2. The big data generation process 

2.3.2 Implementing Application-specific Workloads 
Margo Seltzer et al. pointed out that a benchmarking test 
is meaningful only when applying an application-specific 
workload 12). TTowever, the diversity and rapid evaluation 
of big data systems means it is challenging to develop big 
data benchmarks to reflect various workload cases. TTence 
in big data benchmarks, identifying the typical workload 
behaviors for an application domain is the prerequisite of 
implementing workloads to evaluate big data systems. We 
discuss the workload implementation requirements from 
ihe functional perspective and the system perspective, respec¬ 
tively. 

Functional perspective. Given the complexity and di¬ 
versity of workload behaviors in current big data systems, 
it is reasonable to say that no single set of behaviors is 
representative for all applications. TTence, it is necessary to 
abstract from the behaviors of different workloads to a gen¬ 
eral approach. This approach should identify representative 
workload behaviors in the application domain. In big data 
processing, the workload behaviors can be described as a 
set of operations and workload patterns. 

• Operations represent the abstracted processing ac¬ 
tions (operators) on data sets. For example, select, 
put, get, and delete are the identified operations in 
database systems to operate table data. 

• Workload patterns are designed to combine operations 
to form complex processing tasks. One identified 
workload pattern can contain one or multiple ab¬ 
stract operations as well as their workflow. For ex¬ 
ample, a workload pattern representing a SQL query 


can contain select and put operations, in which the 
select operation executes first. 

System perspective. The identified operations and 
patterns are designed to capture workloads' system- 
independent behaviors, i.e. the data processing operations 
and their sequences. Based on abstracted operations and 
patterns, an abstracted workload can be constructed and 
this workload is independent of underlying systems. From 
the system perspective, this abstract workload can be im¬ 
plemented on different software stacks (MPI, TTadoop and 
Spark) and thereby allows the comparison of systems of 
different type. For example, an abstract workload consisting 
of a sequence of read, write, and update operations can be 
used to compare a DBMS and the TTadoop MapReduce 
system. 

2.3.3 Benchmarkability Requirements 

The requirements of both data and workloads decide the 
worthiness of benchmarking results. In this section, we 
briefly discuss a list of benchmarkability requirements that 
decide the effectiveness of benchmarking big data systems. 

Usability. Usability reflects users' experiences in using 
benchmarks and it is a combination of factors. In big data 
benchmarks, these factors include ease of deploying, con¬ 
figuring, and using across different software stacks; high 
benchmarking efficiencies; simple and understandable per¬ 
formance metrics; convenient user interfaces; and so on. 

Fair measurement. A fair and sensible evaluation has 
twofold meanings. First, big data systems usually have 
many optional configurations, while these configurations 
have different combinations for optimal performance when 
the systems run in different hardware platforms. That is 
to say, using default configurations in measurement cannot 
guarantee fair measurement. Hence when comparing differ¬ 
ent big data systems in heterogeneous platforms, each sys¬ 
tem must be configured separately for fair comparison. For 
example, a big data system may have some specific configu¬ 
ration to improve its performance, this configuration there¬ 
fore is not suitable for fair measurement. Second, repeatabil¬ 
ity is another important requirement of fail measurement. 
This requirement means the parameters of hardware and 
software configurations must be stately clearly so that the 
same result can be obtained when the evaluation is repeated 
several times. In particular, in cloud environment, there are 
multiple virtual machines (VMs) running in one physical 
machine (PM) and competing for compute resources. Hence 
we need to develop a comprehensive evaluation mechanism 
to effectively identify and estimate the impact of resource 
competition on benchmarking results, and try to avoid the 
uncertainties incurred by this resource competition. 

Measurability. Metrics are crucial for quantifying, ana¬ 
lyzing and evaluating benchmarking results. In big data sys¬ 
tems, metrics (either single or multiple metrics) can typically 
be divided into two types: user-perceivable metrics and 
architecture metrics m. User-perceivable metrics represent 
the metrics that matter for users; these metrics are usually 
observable and easy to be understood by users. Examples of 
user-perceivable metrics are the duration of a test, request 
latency, and throughput. While user-perceivable metrics are 
used to compare performances of workloads of the same 
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category, architecture metrics are designed to compare work¬ 
loads from different categories. Examples of architecture 
metrics are million instructions per second (MIPS) and 
million floating-point operations per second (MFLOPS). In 
addition, these metrics should not only measure system per¬ 
formance, but also take energy consumption, cost efficiency 
into consideration. 

Extensibility. The fast evolution of big data systems 
requires big data benchmarks not only keeping in pace with 
state-of-the-art techniques and underlying systems, but also 
taking their future changes into consideration. That is, big 
data benchmarks should be able to add new workloads 
or data sets with little or no change to the underlying 
algorithms and functions H). 

3 State-of-the-art 

In this section, we review related work on big data bench¬ 
marks from the perspectives of data generation and work¬ 
load implementation. 

3.1 Data Generation Techniques 

We review data generation techniques in existing big data 
benchmarks according to the 4V properties of big data, as 
shown in Table 1. 

Volume. Most of existing benchmarks generate scalable 
data as their workload inputs. By contrast, some bench¬ 
marks such as Hibench and LinkBench use both scalable and 
fixed-size data as inputs. Hence we call these benchmarks 
partially scalable in terms of data volume. 

Velocity. To date, data velocity has not been adequately 
addressed in current benchmarks. Some benchmarks such as 
LinkBench, CloudSuite and BigDataBench provide parallel 
strategies to enable the dynamic adjustment of data genera¬ 
tion speed. However, another two equally important aspects 
of data velocity, namely the data updating and processing 
speeds, are not considered in these benchmarks. That is, 
these benchmarks are semi-controllable in terms of data 
velocity. Other benchmarks are classified as un-controllable 
because they do not consider any aspect of data velocity. 

Variety. Table 1 lists the data types and sources sup¬ 
ported by each benchmark. These include structured data 
(tables), unstructured data (text, images videos and audios), 
and semi-structured data (graphs, web logs and resumes). 
We can observe that many benchmarks only support one 
type of data (e.g. the unstructured text data in Hibench 
or the structured table data in YCSB and TPC-DS). Some 
benchmark suites support multiple data types and Big¬ 
DataBench supports the largest number of data sources. 

Veracity. Veracity is one of the most challenging aspects 
in data generation. In many benchmarks such as GridMix, 
SWIM, HiBench and YCSB, the generation process of syn¬ 
thetic data is independent of real raw data. For example, 
in HiBench Q/ the synthetic data sets are either randomly 
generated using the programs in the Hadoop distribution 
or created using some statistic distributions. Data veracity 
therefore is un-considerable in these benchmarks. By con¬ 
trast, some benchmarks such as TPC-DS, LinkBench and 
BigDataBench capture and preserve the important char¬ 
acteristics of real data by identifying data models. The 


synthetic data are then generated using the constructed 
data model, thus partially considering data veracity. For 
example, TPC-DS [flOl implements a multi-dimensional data 
generator (MUDD). MUDD generates most of data using 
traditional synthetic distributions such as a Gaussian distri¬ 
bution. On the other hand, MUDD generates a small portion 
of crucial data sets using more realistic distributions derived 
from real data. Moreover, in BigDataBench |8l, different data 
models are employed to capture and preserve the important 
characteristics of raw data of different types (e.g. table, text, 
and table). 

3.2 Workload Implementation Techniques 

In the context of benchmarking big data systems, many 
existing big data benchmarks implement workloads to eval¬ 
uate specific types of systems or architectures. As listed in 
Table 2, Sort, DFSIO, MRBenchmark, GridMix, PigMix and 
SWIM are designed for Hadoop systems, LinkBench and 
TPC-DS implement workloads for testing DBMSs. HiBench, 
CALDA and BigBench are developed to compare the perfor¬ 
mance between Hadoop and Hives (or DBMSs). Specifically, 
CALDA compare two parallel SQL DBMSs (i.e. DBMS-X 
and Vertica) with Hadoop |17|. The same workloads are 
used to test four typical SQL driven systems, including one 
database (Redshift), one data warehousing systems (Hive), 
and two engines (Spark and Impala) I TSl . TPC-DS is TPC's 
latest decision support benchmark Il2^ designed to test 
the performance of DBMSs in decision support systems. 
Developed from TPC-DS by adding a web log generator and 
a review generator, BigBench aims to compare the Teradata 
Aster DBMS and Hadoop ITol . 

Some other benchmarks target at evaluating NoSQL 
databases or architectures. Yahoo! Cloud Serving Bench¬ 
mark (YCSB) benchmark compares two non-relational 
databases (Cassandra and HBase) against one geographi¬ 
cally distributed database (PNUTS) and a traditional rela¬ 
tional database (MySQL) The CloudSuite benchmark 
l23l is implemented to test cloud service architectures. To 
the best of our knowledge, BigDataBench is the only big 
data benchmark that implements a comprehensive suite 
of workloads covering micro benchmarks, three dominant 
internet services (i.e. Search Engine, Social Network, and 
E-commerce), and two fast emerging big data domains 
(multimedia and bio informatics). 

From the perspective of benchmarking users. Table 2 
roughly divides the current big data workloads into two 
types. (1) Online services: these workloads are sensitive to 
the response delay, i.e. the time interval between the arrival 
and departure moments of a service request. Examples of 
workloads belonging to this category are OLTP and web 
search queries. (2) Offline analytics: these workloads usu¬ 
ally perform complex and time-consuming computations on 
big data. Examples of workloads for testing offline services 
are machine learning algorithms such as k-means clustering 
and Naive Bayes classification. 

4 Research Challenges 

Most of the existing works related to benchmarking big 
data systems can be viewed as attempts to solve specific 
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Benchmark efforts 

Data Volume 

Data Velocity 

Data Variety 

Data Veracity 

Data types 

Data sources 

Sort [llj 

Scalable 

Un-controUable 

Unstructured data 

Texts 

Un-considered 

DFSIO Il2l 

Scalable 

Un-controllable 

Unstructured data 

Texts 

Un-considered 

MRBench Il3l 

Scalable 

Un-controUable 

Structured data 

Tables 

Un-considered 

GridMix 1141 

Scalable 

Un-controllable 

Unstructured data 

Texts 

Un-considered 

PigMix 

Scalable 

Un-controllable 

Unstructured data 

Texts 

Un-considered 

SWIM fl6l 

Scalable 

Un-controllable 

Unstructured data 

Texts 

Un-considered 

Hibench |9J 

Partially scalable 

Un-controllable 

Unstructured data 

Texts 

Un-considered 

CALDA Il7l 

Scalable 

Un-controUable 

Structured 

and unstructured data 

Tables, texts 

Un-considered 

AMPLab 
benchmark 1181 

Scalable 

Un-controUable 

Structured and 
imstructured data 

Tables, texts 

Un-considered 

BigBench 1101 

Scalable 

Semi-controllable 

Structured, semi-structured 
and unstructured data 

Tables, web logs 
and texts 

Partially Considered 

LinkBench 1191 

Partially scalable 

Semi-controllable 

Semi-structured data 

Graphs 

Partially Considered 

TPC-DS l20l 

Scalable 

Semi-controllable 

Structured data 

Tables 

Partially Considered 

BG benchmark 1211 

Scalable 

Semi-controllable 

Structured data 

Tables 

Un-considered 

YCSB l22l 

Scalable 

Un-controllable 

Structured data 

Tables 

Un-considered 

CloudSuite 1231 

Partially scalable 

Semi-controllable 

Structured, semi-structured 

and unstructured data 

Tables, resumes, 

graphs and texts 

Partially Considered 

BigDataBench (S) 

Scalable 

Semi-controllable 

Structured, semi-structured 

and unstructured data 

Tables, resumes, 

graphs, texts, images 
videos and audios 

Partially considered 


TABLE 2 

Comparison of impiemented workioads and supported software stacks in existing big data benchmarks. 


Benchmark efforts 

Workloads 

Software stacks 

Type 

Operations 

SortnH 

Offline analytics 

Sort 

Fladoop 

DFSIO (13 

Offline analytics 

Generate, read, write, append, 
and remove data for MapReduce jobs 

Hadoop 

MRBench (Ts) 

Online services 

MapReduce jobs transformed from 22 

TPC-H queries 

Hadoop 

GridMix 1141 

Online services 

Sort, sampling a large dataset 

Hadoop 

PigMrx Il5l 

Online services 

12 data queries 

Hadoop 

SWIM (H 

Offline analytics 

Synthetic MapReduce jobs of reading, writing, 
shuffling and sorting data 

Hadoop 

HiBench (3 

Offline analytics 

Sort, WordCount, TeraSort, PageRank, 

K-means, Bayes classification. Index 

Hadoop and Hive 

CALDA (T3 

Online services 

Load, scan, select, aggregate 
and join data, coimt URL links 

Hadoop and DBMSs 

AMPLab benchmark 1181 

Online services 

Part of CALDA workloads (scan, aggregate 
and join) and PageRank 

Redshift, Hive, Shark, 

Impala and Tez 

BigBench 1101 

Online services 

Database operations 
(select, create and drop tables) 

Hadoop and DBMSs 

Offline analytics 

K-means, classification 

LinkBench 1191 

Online services 

Database operations such as select, insert, 
update,and delete; 

association range queries and count queries 

DBMS 

TPC-DS 

Online services 

Data loading, queries and maintenance 

DBMS 

BG benchmark 1211 

Online services 

Reading and updating databases 

DBMS and NoSQL systems 

YCSB l22l 

Online services 

OLTP (read, write, scan, update) 

NoSQL systems 

CloudSuite 1231 

Online services 

YCSB workloads 

NoSQL systems, 

Hadoop, GraphLab 

Offline analytics 

Text classification, WordCount 

BigDataBench (3 

Online services 

Database operations (read, write, scan) 

Hadoop, DBMSs, NoSQL systems. 
Hive, Impala, Hbase, MPI, 
Shark, Libc, and other 
real-time analytics systems 

Offline analytics 

1. Micro Benchmarks (sort, grep, WordCount, CFS); 

2. Search engine workloads (index, PageRank); 

3. Social network workloads (connected 
components (CC), K-means and BFS); 

4. E-commerce site workloads (Relational database 
queries (select, aggregate and join), 
collaborative filtering (CF) and Naive Bayes; 

5. Multimedia analytics workloads (BasicMPEG, 

SIFT, DBN, Speech Recognition, Ray Tracing, 

Image Segmentation, Face Detection); 

6. Bioinformatics workloads (SAND and BLAST) 
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benchmarking problems. To date, many aspects of bench¬ 
mark big data systems remain unexplored. Considering the 
emergence of new big data applications and the rapid evo¬ 
lution of big data systems, we believe an incremental and 
iterative approach is necessary to conduct the investigations 
on big data benchmarks. In this section, we summarize some 
of the research challenges to be addressed for a successful 
benchmarking. 


4.1 Generating Data with Veiocity and Veracity Proper¬ 
ties 

One fundamental challenging of successfully benchmarking 
big data systems is about generating data with the 4V 
properties. Given that data volume and variety have been 
well supported in today's benchmarks, how to generate 
data with velocity and veracity properties have not yet 
adequately solved. 

Controllable data velocity. This controllability has two 
meanings. First, existing big data benchmarks only consider 
different data generation speeds. Different data updating 
frequencies and processing speeds should be reflected in 
future big data generators. Second, current data velocity is 
supported using parallel strategies; that is, data velocity can 
be controlled by deploying different numbers of parallel 
data generators. To support the control mechanism at a 
finer level of granularity, future generators can control data 
velocity by adjusting the efficiency of the data generation 
algorithms themselves. For example, in a graph data gener¬ 
ator running on Spark (an in-memory computing platform), 
the data generation speed can be controlled by adjusting the 
allocated memory resources. 

Metrics to evaluate data veracity. As discussed in Sec¬ 


tion 3.1 applying data models to capture and preserve 
important characteristics of real data is an efficient way to 
keep data veracity in synthetic data generation. Ftowever, 
how to measure the conformity of the generated synthetic 
data to the raw data is still an open question; that is, metrics 
need to be developed to evaluate data veracity, thus guid¬ 
ing the improvement and optimization of the constructed 
models. Two types of evaluation metrics can be developed: 
(1) metrics to compare the raw data and the constructed 
data models; (2) metrics to compare the raw data and the 
synthetic data. 

This problem is compounded when considering different 
data types and sources. For example, to compare real text 
data set and synthetic data, we first need to derive the topic 
and word distributions from these data sets. Next, statistical 
metrics such as Kullback-Leibler divergence can be applied 
to compare the similarity between two distributions. Sim¬ 
ilarly when considering table, graph or even stream data, 
some other metrics should be developed. 


4.2 Identifying Dwarf Big Data Workioads 

The fast development of big data systems has lead to a 
number of successful application domains such as scientific 
analytics, search engines, social networks, and streaming 
process. Each of these application domains is the focus 
of one or multiple big data platform efforts. A successful 
big data benchmark, therefore, should provide workloads 
with representativeness and wide coverage. However, the 


complexity and diversity of big data systems impose great 
challenges on workload selection, as it is unpractical to 
implement all big data workloads. Within this context, 
identifying a minimal set of dwarf workloads to represent 
diverse big data workloads provides an effective approach 
|25] |. These dwarf workloads have two appealing features. 
First, they represent the highly abstractions of computation 
(data operations), communication and workload patterns 
frequently appearing in big data processing. Second, they 
compose a minimum set of necessary functionality Il26l , thus 
provide guidelines for performance optimization Il25l . 

Considering the diversity of big data domains and the 
heterogeneity of big data systems, we summarize the chal¬ 
lenges in identifying dwarf workloads as follows. (1) As 
big data systems being applied in more and more domains, 
it is difficult to cover all these domains. (2) Even in some 
popular domains such as big data analytics, identifying the 
frequently appearing operations from a wide range of data 
mining and machine learning algorithms is not trivial. (3) 
The problem is compounded when considering the variety 
of data in big data systems, in which 80% of operations are 
conducted on unstructured data fI7\. 

4.3 Automating Test Generation 

In practice, generating benchmarking tests from identified 
dwarf workloads including data operations and workload 
patterns may be beyond the capabilities of the average 
benchmark users. Hence going mainstream with this frame¬ 
work requires the development of an environment that 
supports benchmark users in profiling the behaviors of ap¬ 
plications in the target domain, as well as offers a repository 
of reusable data models, operations and workload patterns 
to simplify the generation of input data, workloads and tests 
running on state-of-the-art software stacks. 

Moreover, the automatic generation of benchmarking 
tests also needs a parameterized framework that enables 
the flexible adjustment in input data and workload gen¬ 
eration, thus meeting the requirements of different bench¬ 
marking scenarios. Developing such a framework requires 
addressing two challenges: (1) based on the identified data 
models and workload operations, the framework should 
derive optional parameters that allow users use different 
parameter configurations to customize their input data and 
workload generation. (2) The framework should study and 
quantify the corresponding relationship between input data 
and workloads, thus imposing constraints on the provided 
parameters to guarantee the reasonable matching in input 
data and workload generation. 

4.4 Characterizing Big Data Workioads 

Understanding and characterizing workloads provides the 
foundation for optimizing and promoting big data systems. 
A typical example can be the system architecture design 
driven by advances in processor designs, in which signif¬ 
icant efforts are contributed to workload characterization 
in order to obtain the optimization guides or implications 
of the next generation processors (8l|23|28j|29l. Within the 
context of big data systems, there are three major challenges 
in big data workload characterization. 
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Obtaining performance data. In traditional workloads 
such as Parsec II30I , architecture simulators are typically 
used to acquire fine grain performance data in order to 
characterize these workloads from the perspective of micro¬ 
architecture. Such simulation-based workload characteriza¬ 
tion is prohibitively expensive for big data workloads. This 
is because data big data workloads always process data with 
deep software stacks, which means it may need thousands 
of billions of instructions to complete one workload and 
this processing can take several months when running on 
simulators. Hardware performance counters offer an cost- 
efficient solution to obtain processors' predefined micro¬ 
architecture level statistics such as cache misses and number 
of retired instructions. However, they can only provide 
coarse-grained architecture dependant data, but lack the 
support of fine-grain performance data such as cache false 
sharing. In addition, Dimitrov et al. can get the fine 
grain data by the analysis the memory DIMM traces, but 
this data acquisition replies on special hardware. Moreover, 
when characterizations big data workloads from other per¬ 
spectives such as memory and storage subsystem, there 
exists similar challenges of obtaining fine-grained perfor¬ 
mance data without simulation or special hardware support, 
which prohibits the thorough understanding of workload 
behaviors. 

Determining a suitable data volume. In big data work¬ 
loads, the acquired performance data vary with data vol¬ 
umes Il3^ l33l . and thus data volumes affect the result 
of workloads characterization. In big data systems, data 
volumes depend on a list of factors including the scale 
of cluster, the configuration of each node, and the tested 
workloads. Hence a range of optional data volumes exist 
when testing a workload on a specific cluster. How to 
determine a suitable data volume is still an open question. 

Subsetting big data workloads. Today's big data sys¬ 
tems have a large number of workloads and these work¬ 
load are changing frequently, which is called workload 
churns (34 i. This means even on the real hardware (rather 
than using simulators), the evaluation can be very time 
consuming. It is therefore essential to develop effective 
approaches to remove redundant workloads and generate 
a subset of workloads with manageable number, while still 
keeping their representativeness of the whole workload set. 
Jia et al. propose an approach that derives a subset of 
representative workloads from the perspective of micro¬ 
architecture |35l. How to subset workloads from other per¬ 
spectives still needs to be investigated. Furthermore, how 
to adapt to the workload churns (that is, new workloads 
are continuously added) in workload subsetting is another 
challenge to be addressed. 

4.5 Generating Realistic Mixed Workloads 

With the fast development of big data systems and appli¬ 
cations, a diverse mix of workloads share a common com¬ 
puting infrastructure in modem cloud data centers. These 
workloads have different system behaviours and input data 
sizes (e.g. ranging from KB to PB), and their behaviors 
also heavily rely on the underlying software stacks such as 
Hadoop, Spark, Hive and Impala. Moreover, their dynamic 
arrival patterns including request/job arriving rates and 


sequences are equally important aspects of workload fea¬ 
tures to be considered. As big data systems such as Hadoop 
mature, the pressure to benchmark and understand these 
mixed workloads rises. 

To generate realistic mixed workloads such that trust¬ 
worthy benchmarking reflecting the practical data center 
scenarios can be conducted, three major challenges should 
be addressed. First, it is difficult to just use synthetic work¬ 
loads to emulate the behaviors of workloads with highly 
diverse features in terms of workload types, input sizes and 
software sizes. Hence it is necessary to generate benchmark 
results using actual workloads. Second, we believe profiling 
history logs of real applications, namely actual workload 
traces, is a good way to obtain realistic arrival patterns. How 
to use this profiling information to guide the generation 
of actual workloads is still an open question. Finally, it is 
challenging to produce workloads at different scales to meet 
the requirements of different benchmarking scenarios, while 
still keep the realistic mix of big data workloads. To the best 
of our knowledge, none of exiting big data benchmarks have 
solved all the above challenges. 

4.6 Benchmarking Emerging In-memory Computing 
Systems 

As the requirement of low latency computation increases, 
many researchers now pay more attention on how to use 
memory more efficiently to improve performance of data 
processing. A typical example of applying this in-memory 
computing paradigm is Spark, a big data platform that 
adopts memory locality and intermediate result caching to 
speed up big data processing. At present, the Spark commu¬ 
nity has built an ecosystem to support various application 
domains including machine learning, SQL query, graph 
computation, streaming applications, and R language. 

The features of in-memory computing systems give rise 
to two major challenges in benchmarking them. First, mem¬ 
ory has a larger impact on system performance than other 
resources, hence the workloads' efficiently of using mem¬ 
ory resources significantly impact benchmarking results. 
However, this memory usage is difficult to control when 
benchmarking some in-memory computing systems. For 
example. Spark relies on JVM for memory management and 
this makes it difficult to monitor the actual memory usage in 
benchmarking. Second, in-memory computing systems such 
as Spark apply compression and serialization techniques to 
reduce the memory usage by significantly increasing the 
CPU utilization. In this case, how to measure the CPU 
usage and other low level performance metrics is another 
challenging to be addressed. 

4.7 Supporting Heterogeneous Hardware Platforms 

With the fast development of technology, the emerged hard¬ 
ware platforms significantly change the way about how to 
process data and show a promising prospect to improve 
data processing efficiency. For example, the heterogeneous 
platforms of Xeon+General-purpose computing on graphics 
processing units (GPGPU) and Xeon+Many Integrated Gore 
(MIC) can significantly improve the processing speed of 
HPC applications. However, to date, both platforms are only 
limited to the HPC area; that is, the diversity of big data 
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applications are not fully considered in these platforms. 
For such an issue, big data benchmarks should be devel¬ 
oped to evaluate and compare workloads in state-of-the- 
practice heterogeneous platforms. The evaluation results are 
expected to show: (1) whether any platform can consistently 
win in terms of both performance and energy efficiency for 
all big data applications, and (2) for each class of big data 
applications, we hope to find some specific platform that can 
realize better performance and energy efficiency for them. 
To support the evaluation of an application, current big 
data benchmarks should be extended to provide a uniform 
interface to enable this application running in different plat¬ 
forms. In order to perform apples-to-apples comparisons, 
this application should also be running in the same software 
stack. 

5 Conclusion 

With the rapid development of information technology, big 
data systems have emerged to manage and process data 
with high requirements of volume, velocity, variety and 
veracity. These emerging systems have given rise to various 
new challenges about how to develop a new generation 
of benchmarks. In this paper, we summarize the lessons 
we have learned and challenges in developing big data 
benchmarks mainly from two aspects: (1) how to develop 
data generators capable of preserving the 4V properties of 
big data; (2) how to implement application-specific work¬ 
loads while still covering a diversity of typical application 
scenarios and supporting different system implementations 
and software stacks. Following these two aspects, we review 
existing benchmarking techniques on big data systems and 
present some future research directions. The work presented 
in this paper represents our effort towards building a truly 
representative, comprehensive and cost-efficient big data 
benchmark and we encourage more investigations and de¬ 
velopments to make this become a reality. 
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